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Abstract 

Background: Sequence features in promoter regions are involved in regulating gene transcription initiation. 
Although numerous computational methods have been developed for predicting transcriptional start sites (TSSs) or 
transcription factor (TF) binding sites (TFBSs), they lack annotations for do not consider some important regulatory 
features such as CpG islands, tandem repeats, the TATA box, CCAAT box, GC box, over-represented 
oligonucleotides, DNA stability, and GC content. Additionally, the combinatorial interaction of TFs regulates the 
gene group that is associated with same expression pattern. To investigate gene transcriptional regulation, an 
integrated system that annotates regulatory features in a promoter sequence and detects co-regulation of TFs in a 
group of genes is needed. 

Results: This work identifies TSSs and regulatory features in a promoter sequence, and recognizes co-occurrence of 
c/s-regulatory elements in co-expressed genes using a novel system. Three well-known TSS prediction tools are 
incorporated with orthologous conserved features, such as CpG islands, nucleotide composition, over-represented 
hexamer nucleotides, and DNA stability, to construct the novel Gene Promoter Miner (GPMiner) using a support 
vector machine (SVM). According to five-fold cross-validation results, the predictive sensitivity and specificity are 
both roughly 80%. The proposed system allows users to input a group of gene names/symbols, enabling the co- 
occurrence of TFBSs to be determined. Additionally, an input sequence can also be analyzed for homogeneity of 
experimental mammalian promoter sequences, and conserved regulatory features between homologous promoters 
can be observed through cross-species analysis. After identifying promoter regions, regulatory features are 
visualized graphically to facilitate gene promoter observations. 

Conclusions: The GPMiner, which has a user-friendly input/output interface, has numerous benefits in analyzing 
human and mouse promoters. The proposed system is freely available at http://GPMiner.mbc.nctu.edu.tw/. 



Background believed to comprise short DNA sequences known as 

Gene transcription is regulated by transcription factors regulatory elements, including TF binding sites (TFBSs) 

(TFs) that bind specifically to promoter regions; which [2]. With the vast amount of available genomic data, an 

is the crucial control region for transcriptional activation increasing need exists for techniques that can rapidly 

of all genes [1]. A typical promoter sequence, which is and accurately evaluate sequences for the presence of 

located near the transcriptional start site (TSS), is promoters [3]. Furthermore, some important regulatory 

motifs, such as the TATA box, CCAAT box, GC box, 
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promoters, and restricting a promoter region from using 
information from mRNA transcripts must be considered 
[4]. Additionally, some co-regulatory networks describe 
the set of all significant associations among TFs in regu- 
lating common target genes [5]. Accordingly, the combi- 
natorial interaction of TFs is critical in gene regulation. 

PlantPAN, a database-assisted system for recognizing 
co-occurrence of ds-regulatory elements in plant co- 
expressed genes [6], is effective for plant promoter 
investigations. However, no similar resource exists for 
identifying co-occurrence TFBSs in a group of mamma- 
lian promoters. Veerla et al. recently developed SMART 
software for identifying co-occurring TFBSs in gene set 
promoters [7]. Nevertheless, this software does not have 
a user-friendly interface for identifying TSSs with regu- 
latory elements and efficiently analyzing combinatorial 
TFBSs of a group of promoters. COXPRESdb provides 
coexpressed gene networks and coexpressed gene lists 
ordered based on the strength of coexpression for 
humans and mice [8]. However, COXPRESdb does not 
analyze TFBSs in co-expressed gene promoters. 
Although TOUCAN is a Java application for identifying 
significant ds-regulatory elements from sets of co- 
expressed genes, TOUCAN ignores combinatorial 
TFBSs analysis [9]. This work develops a novel system, 
Gene Promoter Miner (GPMiner), for identifying co- 
occurring TFBSs in a group of gene promoters. 

However, the promoter region must be precisely iden- 
tified before identification of TFBSs co-occurrence. 
Many databases are useful in collecting numerous TSSs 
and have promoter prediction tools. The DBTSS is a 
TSS database established by gathering experimentally 
identified promoter regions via the oligo-capping 
method [10]. The Eukaryotic Promoter Database (EPD) 
is an annotated non-redundant collection of eukaryotic 
POL II promoters, for which the TSS has been deter- 
mined experimentally [11]. Various promoter prediction 
methods have been developed for analyzing gene pro- 
moter regions (Table SI, additional file 1). The 
CpGProD program identifies CpG islands in mammalian 
promoter regions [12]. The DragonGSF program pre- 
dicts gene promoters based on information of CpG 
islands, TSSs and downstream signals of predicted TSSs 
[13]. The NNPP2.2 program applies a time-delay neural 
network for promoter annotation of the Drosophila mel- 
anogaster genome [14]. The Eponine detects the tran- 
scriptional initiation site near the TATA box, together 
with flanking regions of GC enrichment [15]. To identify 
TSSs, McPromoter, a statistical method, identifies the 
eukaryotic polymerase II TSS in genomic DNA [16-18]. 
The FirstEF uses a set of discriminant functions that 
can recognize both boundaries of the first exon [19]. 
The PromoSer method computationally identifies TSSs 
by considering the alignments of numerous partial and 



full-length mRNA sequences to those of genomic DNA 
[20]. The PromH scheme identifies promoters based on 
conservation of regulatory features in pairs of human/ 
mouse orthologous genes. Another regulatory feature of 
promoter regions, DNA stability, was investigated for 
analyzing prokaryotic promoters [21]. Notably, DNA 
stability is a structural property of the DNA duplex frag- 
ment. The minimum free energy of the DNA duplex is 
calculated based on hydrogen bonding of A-T and C-G 
pairs. Kanhere et al. demonstrated that DNA stability of 
promoter regions provides a much better clue than 
other features when determining the location of the TSS 
[21]. 

Although numerous computational methods have 
been developed for identifying promoters of genes in 
genomic sequences, their outcomes are not satisfactory, 
especially for promoters lacking a TATA box and CpG 
islands [1]. Furthermore, many methods have poor pre- 
dictive specificity, generating many false-positive predic- 
tions, or have poor sensitivity. Therefore, this work 
develops an integrated system, GPMiner, that identifies 
promoter regions with high predictive sensitivity and 
specificity. Moreover, GPMiner comprehensively anno- 
tates regulatory elements, including TFBSs, CpG islands, 
tandem repeats, the presence of a TATA box, CCAAT 
box, or GC box, statistically over-represented sequence 
patterns, GC content (GC%), and DNA stability. Addi- 
tionally, GPMiner accurately identifies combinatorial 
TFBSs in a group of gene promoters. 

Construction and content 

Figure 1 presents the GPMiner system flow, which iden- 
tifies promoter regions and annotates transcriptional 
regulatory features in a user-input genomic sequence. 
Computational models for promoter identification were 
constructed by incorporating the support vector 
machine (SVM) with nucleotide composition features, 
over-represented hexamer nucleotides, and DNA stabi- 
lity. Additionally, GPMiner allows users to input a 
group of genes for identification of co-occurring TFBSs 
in promoter sequences. All mined promoter regions and 
regulatory features in the user-input sequence are visua- 
lized graphically to facilitate analysis of gene transcrip- 
tional regulation. The details of the proposed method 
are as follows. 

Input genomic sequence 

Users first input a genomic sequence in the FASTA for- 
mat to identify promoter regions and to mine regulatory 
elements within the input sequence. The input sequence 
is used to search for homogeneity of experimental mam- 
malian promoter sequences collected from the DBTSS 
(version 6.0) [10], EPD (release 80) [11] and Ensembl 
(version 61) [22]. All experimentally verified TSSs are 
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input genomic sequence 
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TAGGATTACAGGCATGAGCTACCGTGCCTATGAGCTACCATGAGCTAC. 
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Graphical visualization 



Using the GD library of the PHP programming language to visualize the 
identified promoter regions and mined regulatory features 



Figure 1 System flow of GPMiner. 



using genomic positional information provided by 
DBTSS and EPD. By default, all the base pairs (bps) 
starting with the upstream 2000 bps to the downstream 
200 bps relative to the TSS (+1) are defined as promoter 



regions and extracted for a sequence homology search. 
Notably, GPMiner collects 22774, 25420, 22159, 22475, 
and 18201 known genes from five mammalian genomes, 
including the human, mouse, rat, chimpanzee, and dog 
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genomes, respectively. After the sequence homology 
search, the proposed system outputs a set of known 
genes with promoter sequences resembling the input 
sequence. Additionally, users can input the chromoso- 
mal location to specify sequence regions for mining reg- 
ulatory features. 

Promoter identification 

The GPMiner system uses a SVM that considers ortho- 
logously conserved regulatory features, such as CpG 
islands, nucleotide composition, over-represented hex- 
amer nucleotides, and DNA stability, of a promoter 
sequence to identify mammalian proximal promoters 
(Figure 2). The promoter length of mammalian cell is 
usually around 1000 bp [23]. Because some regulatory 
elements locate far from TSS, numerous ds- regulatory 
elements annotation system used 3000 bp upstream as 
the maximum region for analysis [24]. Furthermore, sev- 
eral studies indicate the downstream region of TSS play 
critical roles during transcription. Therefore, 3000 bp 



downstream of TSS are also selected to analyze. Conse- 
quently, experimentally identified promoters originating 
from human and mouse genomes collected from the 
DBTSS (Table S2, additional file 1) were mapped to 
Ensembl genomic positions, and flanking sequences of 
-3000 bps to +3000 bps around the mapped TSSs were 
selected. Furthermore, homologous promoter sequences 
between human and mouse genomes were analyzed 
using the BLAST program [25]. The sequence identity 
of homologous promoter sequences exceeding 80% were 
extracted and defined as training sequences. These 
training sequences were classified into two subgroups 
based on whether CpG islands were present by 
CpGProD [12]. Table S3 (in additional file 1) lists the 
statistics of the classified training set. 

After constructing and classifying the training set, 
training sequences are first analyzed with their nucleo- 
tide composition to calculate the occurrence rate of 
mono-, di-, and tri-mer nucleotides within a 20-bp win- 
dow sliding along training sequences. Figure SI (in 
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Figure 2 Analytical flowchart of promoter identification. 
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additional file 1) lists average distributions of occurrence 
rates of nucleotide compositions. Pearsons correlation 
coefficient is calculated for clustering average distribu- 
tions of mono-, di-, and tri-mer nucleotides into two 
groups based on the two major distributions of adenine 
and guanine (Table S4, additional file 1). Furthermore, 
training sequences are also used to extract over-repre- 
sented 6-mer nucleotides within a specified window size 
around the TSSs, which comprise the so-called positive 
set. The occurrence probabilities of 6-mer nucleotides in 
the specified window are calculated and compared to 
background probabilities of the entire genome. By opti- 
mizing the number (50-200) of over-represented 6-mer 
nucleotides, the top 100 over-represented 6-mer nucleo- 
tides are selected as training features. 

Furthermore, DNA stability is a feature used for iden- 
tifying promoter sequences. SantaLucia et al. [26] used 
the unified standard free energy of ten dinucleotide 
duplexes-AA/TT, AT/TA, TA/AT, CA/GT, GT/CA, 
CT/GA, GA/CT, CG/GC, GC/CG, and GG/CC [26] 
(Table S5, additional file l)-to calculate the standard 
free energy change of a DNA oligonucleotide based on 
dinucleotide composition. This work applied the equa- 
tion of standard free energy change to determine the 
stability of a DNA duplex with a window size of 15 nt 
sliding from -3000 to +3000, corresponding to the TSSs 
in training sequences. Figure S2 (in additional file 1) 
shows distributions of average free energy of DNA 
duplex formation. Near the TSS, a peak exists in the 
region starting from -10 to -30, which corresponds to 
the TATA box in eukaryotic promoter sequences. 

A public SVM library LIBSVM [27] is used to con- 
struct predictive models. The SVM kernel function is 
set to the radial basis function (RBF). Before using 
extracted regulatory features to train SVM models, the 
specified window sizes of proximal promoter regions, 
which comprise the so-called positive set, must be 
defined. Therefore, five window sizes-60 to +20, -100 to 
+50, -200 to +100, -300 to +150, and -400 to +200-are 
defined, and a benchmark is applied to evaluate the pre- 
dictive performance of proximal promoter regions. The 
benchmark, namely, cross-validation, extracts equal sizes 
from the positive set and negative set, constructs the 
SVM model, and evaluates the model with /c-fold cross- 
validation. Training sequences within the specified win- 
dow are defined as the positive set; regions other than 
those in specified windows, with window sizes equal to 
those in the positive set, are chosen randomly as the 
negative set. 

Predictive performance of the constructed models is 
evaluated by five-fold cross-validation [28]. Training 
data are divided into five groups by splitting each data- 
set into five approximately equally sized subgroups. 
During cross-validation, each subgroup is used as the 



validation set in turn, and the remaining comprise the 
training set. Next, the measures of predictive perfor- 
mance of trained models are Precision (Prec) = TP/(TP 
+FP), Sensitivity (Sn) = TP/(TP+FN), Specificity (Sp) = 
TN/(TN+FP), and Accuracy (Acc) = (TP + TN)/(TP+FP 
+TN+FN), where TP, TN, FP, and FN are the true posi- 
tive, true negative, false positive, and false negative pre- 
dictions, respectively. The constructed SVM models of 
three different regulatory features are measured, and 
models with the best predictive accuracy are selected for 
the mammalian proximal promoter prediction. More- 
over, several promoter prediction tools, NNPP2.2 [14], 
Eponine [15] and McPromoter [16], are integrated into 
GPMiner to provide additional information about the 
proximal promoter, thereby improving predictive 
specificity. 

Mining c/s-regulatory features 

After identifying proximal promoter regions, regulatory 
elements involving gene transcriptional regulation, such 
as transcription factor binding sites, CpG islands, the 
TATA box, CCAAT box, GC box, and over-represented 
sequences, are annotated. Furthermore, tandem repeats 
and DNA stability and GC content in the promoter 
region are provided for advanced analysis of gene tran- 
scriptional regulation. Table 1 shows the integrated 
databases and GPMiner tools for mining regulatory ele- 
ments within input sequences. For instance, MATCH 
[29] was utilized for scanning TFBSs in an input 
sequence using the TF binding profiles from TRANS- 
FAC public release version 7.0 [30] and JASPAR [31]. 
The CpGProD program [12] was applied to detect the 
CpG island in a promoter region with a prediction spe- 
cificity of roughly 70%. A tandem repeat finder [32] was 
applied to identify tandem repeats in promoter 
sequences. In detecting the TFBS in promoter regions, 
cutoff values of core and matrix scores of the MATCH 
program are set to 1.0 and 0.7, respectively. Particularly, 
frequent regulatory elements, such as the TATA box, 
CCAAT box, and GC box, are represented separately. 

Several important regulatory features, such as repeats 
and over-represented oligonucleotides, are integrated. 
Repeats, such as tandem repeats, Alu, and LI elements 
can alter OR the methylation distribution in a genome, 
and possibly in gene transcription [33,34]. The proposed 
system applies a statistical method to identify over- 
represented oligonucleotides (6-12 bps) in promoter 
regions; these over-represented oligonucleotides are 
identified by comparing their occurrence frequencies in 
promoter regions with their background occurrence fre- 
quencies throughout the whole genome (See additional 
file 1 for a detailed description). Based on statistical sig- 
nificance, this work chose the oligonucleotide with a Z- 
Score > 5 as the OR sequence. Moreover, DNA stability 
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Table 1 Supported regulatory features in GPMiner 



Regulatory features 


Integrated 
database or tools 


Descriptions 


Transcriptional start site 


MMDD1 1 n Al 

NNrrZ.Z LI4J 


Applying a time-delay neural network for promoter annotation 




ivicrromoter [ \ oj 


Using a statistical method to identify eukaryotic polymerase II TSS in genomic DNA 




Eponine [15] 


Predicting the transcription start site for a DNA sequence with prediction specificity 

> 70% 


Transcription factor (TF) binding site 


TRANSFAC public 
release /.u |_4oj 


Storing the experimentally verified transcription factors, their genomic binding sites 
and DNA-binding profiles 




matt~u nm 
IV1A I LH |zyj 


Scanning the transcription factor binding site using the transcription factor binding 
profiles from TRANSFAC public release 7.0 and JASPAR 


CpG island 


LpurroU [ I zj 


Detecting the CpG island 


Repeats 


Tnr [oil 
1 Kr [dZ\ 


A tandem repeat finder 


TATA box, CCA AT box, and GC box 


MATCH [29] 


Scanning the TATA-, CCAAT- and GC-box by the transcription factor binding 
nrnfilpq frnm TRANSFAC 

kJI^JIIICTo llwlll II \l\ 1 N.JI / 




Narang et al. [47] 


Defining the 6-mer pattern of the TATA box, CCAAT box, and GX box with 
positional density 


Over-represented pattern 


Huang et al. [48] 


Defining the statistically significant pattern in the promoter region 


DNA stability 


Aditi Kanhere et al. 
[21] 


Predicting the DNA stability of the promoter region 


Co-occurrence of TF binding sites 


apriori [35] 


A method to mine the association rules 


Conserved regions between homologous 
gene promoter sequences 


Blast [25] 


Using the blast program to analyze the conserved region between the 
homologous gene promoter sequences 



distributions are provided. The GC contents are also 
calculated using a window size of 15 nt and used as 
references for identification of CpG islands. 

Identifying co-occurrence of TFBSs in a group of gene 
promoters 

The GPMiner functionalities allow users to input a 
group of genes to mine co-occurrence of TFBSs in pro- 
moter regions. A mining association rules method, 
namely, a priori [35], is applied to mine the co-occur- 
rence of TFBSs in a group of gene promoter sequences. 
Consider a large database with transactions, in which 
each transaction consists of a set of items. An associa- 
tion rule is an expression, such as A > B, where A and 
B are item sets. The related mining association rule 
states that a transaction in a database containing A also 
contains B. For example, 90% of people who purchase 
beer also purchase diapers. Herein, 90% is rule confi- 
dence. Support of the A > B rule used is the percentage 
of transactions containing both A and B. 

The formal problem statement is as follows. Let S - 
{si, s2, sm} be a set of known TFBSs of the human 
genome. The union of members in the set S is called 
the item set. Let G = {gl, g2, gm} be a group of genes 
with differential expression in a specific tissue. Each 
promoter region of a gene is mapped to a transaction 
containing a set of known regulatory sites, also called 
items. We assume promoter region S contains A, a set 
of items of /, when A Q S. An association rule is an 
implication of the relationship A > B, where A <= /, B c 



/, and A n B = (p. The A > B rule holds in the set of 
promoter regions D with confidence conf when c% of 
transactions in D contains both A and B. The A > B 
rule has support sup in the repetitive sequence set D 
when 5% of promoter regions in D contains A U B. The 
association rules, the so-called co-occurrence of TFBSs, 
are generated when a rule has higher support and confi- 
dence than those specified by a user. 

After mining co-occurrences (combinations) of TFBSs 
in a group of gene promoter sequences, the statistical 
significance each combination must be examined against 
the background set of genes using the hypergeometric 
model: 

T C T x C K ~ T 
t H 

where K is the number of background gene promoters 
used, T is the number of observed gene promoters input 
by users, k is the number of promoters that have the 
combination in the background gene set, and t is the 
number of promoters that have the combination in the 
observed gene set. The P-value is calculated for each 
combination based on the hypermetric equation-the P- 
value decreases, statistical significance increases. 

Graphical visualization 

After mining proximal promoter regions and regulatory 
features, all mined regulatory features are presented gra- 
phically in the web interface, which is constructed using 
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the GD library and PHP programming language. To 
simplify graphical visualization, regulatory features with 
numerous entries are presented initially in an overview 
form. Regulatory features are displayed in detail when 
users click the "detailed view" button. Additionally, 
detailed information of regulatory features is listed in 
tabular form. The co-occurrences of TFBSs in a set of 
gene promoter sequences are also represented graphi- 
cally. When users investigate promoters of known 
genes, conserved regions of homologous gene promoters 
are displayed graphically, as are regulatory features 
found in conserved promoter regions. The graphical 
visualization of regulatory elements facilitates analysis of 
gene transcription regulation. 

Utilities and discussion 

Performance of promoter identification 

A benchmark, namely, cross-validation, is used to evalu- 
ate the predictive performance of GPMiner, which 
incorporates an SVM with nucleotide composition, 
over-represented hexamer nucleotides, and DNA stabi- 
lity for mammalian proximal promoter identification. 
The benchmark is used to extract equal sizes of the 
positive set and negative set, construct the SVM model, 
and evaluate the model with k-fold cross-validation (k = 
5). Table S6 (in additional file 1) lists the prediction per- 
formance of the constructed SVM models trained with 
three different regulatory features based on the five win- 
dow sizes. Since training sequences are classified into 
two subgroups by CpG islands-with CpG islands and 
without CpG islands-predictive performance of group 
with CpG islands is markedly higher than that of the 
group without CpG islands; furthermore, as window size 
increases, the prediction performance of SVM models 



increases. However, after considering both prediction 
performance and window size, a window size of -200 to 
+ 100 is selected as the specified window for identifying 
proximal promoter regions. Vertebrate gene expression 
is frequently regulated by the proximal promoter, which 
is traditionally defined as between -200 bp and the TSS 
[36]. 

Table 2 lists the predictive performance of SVM mod- 
els trained with combinations of the three different reg- 
ulatory features, such as over-represented hexamer 
nucleotides (OR), nucleotide composition (NC), and 
DNA stability (DS). Three training sets, "all", with CpG 
islands, and without CpG islands, are evaluated by 
benchmark cross-validation, and based on the specified 
window size of 200 to 100 relative to the TSS (+ 1). In 
all three training sets, the combination OR+NC+DS per- 
forms better than other combinations. Moreover, the 
training set, namely, that with CpG islands, which 
achieves a predictive accuracy of 82%, performs better 
than training sets of "all" and without CpG islands. Both 
SVM models trained with the training sets with CpG 
islands and without CpG islands are used for proximal 
promoter identification. Whether an input sequence 
contains a CpG island is then detected, and the 
sequence is then predicted by the SVM model with 
CpG islands or the SVM model without CpG islands. 

Notably, GPMiner lets users input a novel sequence to 
annotate the proximal promoter region with the putative 
TSS. Thus, 1871 human promoter sequences (from 
-3000 to +3000) in the EPD comprise the independent 
test set used to evaluate predictive performance. The 
test sequences whose regions are within -200 to +100 
relative to the TSSs (+1) are defined as a positive set; 
otherwise, the negative set is extracted randomly from 



Table 2 The prediction performance of SVM models with combinations of three kinds of regulatory features such as 
over-represented hexamer nucleotides (OR), nucleotide composition (NC), and DNA stability (DS), is evaluated by 
benchmark "Cross-validation" based on the specified window size -200 to +100 of TSS(+1). 

Training set Window size Features Precision Sensitivity Specificity Accuracy 



All -200 -+100 OR+NC 77% 71% 79% 75% 

(6,452) 





-200 -+100 
-200 -+100 
-200 -+100 


OR+DS 
NC+DS 
OR+NC+DS 


76% 
75% 
79% 


69% 
74% 
76% 


78% 
76% 
79% 


74% 
75% 
78% 


With CpG 

(4,898) 


-200 -+100 

-200 -+100 
-200 -+100 
-200 -+100 


OR+NC 

OR+DS 
NC+DS 
OR+NC+DS 


79% 

77% 
77% 
80% 


81% 

80% 
82% 
84% 


79% 

76% 
75% 
79% 


80% 

78% 
78% 
82% 


Without CpG (1,554) 


-200 -+100 
-200 -+100 
-200 -+100 
-200 -+100 


OR+NC 
OR+DS 
NC+DS 
OR+NC+DS 


68% 
68% 
66% 
69% 


70% 
71% 
67% 
69% 


67% 
66% 
66% 
71% 


68% 
68% 
66% 
70% 



The number of training sequences used to construct the SVM models is shown in parenthesis of the column "Training set". 



Lee et al. BMC Genomics 2012, 13(Suppl 1):S3 
http://www.biomedcentral.eom/1 471 -21 64/1 3/S1/S3 



Page 8 of 1 2 



regions other than those in the positive set. Table S7 (in 
additional file 1) compares the predictive performance 
of GPMiner and those of NNPP2.2, Eponine, and 
McPromoter. Furthermore, Figure S3 (in additional file 
1) shows the distribution of promoter predictions of 
GPMiner, NNPP2.2, Eponine, and McPromoter. The 
sensitivity of GPMiner is better than that of the other 
methods; however, predictive specificity of McPromoter 
and Eponine are better than that of GPMiner. With 
consideration of high specificity, NNPP2.2, Eponine, and 
McPromoter are integrated to reduce the number of 
false-positive predictions. 
Web interface 

The GPMiner system has two primary functions. First, 
"gene group analysis" is adopted to identify co-occur- 
rence of TFBSs in a group of gene promoters. Combina- 
torial regulation by TF complexes is an important 
feature of eukaryotic gene regulation [5,37,38]. Second, 
"promoter analysis" can be employed to analyze TFBSs, 
CpG islands, tandem repeats, the presence of a TATA 
box, CCAAT box, or GC box, statistically over-repre- 
sented sequence patterns, GC content (GC%) and DNA 
stability in the promoter sequence of a given gene ID or 
a novel promoter sequence. Furthermore, cross-species 
analysis of homologous gene promoters is performed by 
GPMiner, such that conserved regulatory features in 
promoter regions can also be observed. 

Figure 3 shows the web interfaces of GPMiner. In the 
submission interface, users first choose one of five mam- 
mals, such as human, mouse, rat, chimpanzee or dog, 
and input a genomic sequence or chromosomal location 
for identifying proximal promoter regions and for 
mining regulatory features. Eight regulatory features cur- 
rently exist in GPMiner. By default, all regulatory fea- 
tures are chosen for annotation in the input sequence. 
Notably, users can input a chromosome location to spe- 
cify regions of interest for retrieving genes located in 
this chromosome region. During the mining process, the 
proposed system uses various tools individually to anno- 
tate different regulatory features in an input sequence. 
Each annotating tool for regulatory features has some 
search parameters, such as score threshold in NNPP2.2, 
Eponine, and McPromoter, the core score and matrix 
score for the MATCH program, Z-score for over-repre- 
sented oligonucleotides, and support and confidence 
scores for co-occurrence TFBSs analysis, in a gene 
group search. Default parameters for these tools are set 
and the related documentation is shown on the help 
webpage. After mining regulatory features, a graphical 
visualization of identified promoter regions and mined 
regulatory features is provided to users. Figures S4 and 
S5 (see additional file 1) present graphical representa- 
tions of regulatory elements for known gene promoter 
and homologous promoter sequences, respectively. 



Case studies 

Figure 4 shows an example gene group analysis. Nota- 
bly, NFkappaB is a well-known induced TF that controls 
kinetically complex patterns of gene expression in multi- 
ple pathways in human. In a previous study, ATM, 
EP300, FGFB1, and SFN were regulated by NF-kappaB 
and co-regulated by the Ets TF in the progression of 
various cancers [39]. To effectively apply GPMiner, four 
gene names were input for gene group analysis by 
GPMiner to detect co-occurring TFBSs. The thresholds 
of the core score and matrix score values in TFBS scan- 
ning were 1.0 and 0.9, respectively, and the support and 
confidence values in co-occurrence analysis were set 
both at 90%. Notably, NF-kappaB and Ets are also iden- 
tified as combinatorial TFs in these four gene promoters 
after three analytical steps by GPMiner. This effective 
result was confirmed by known regulatory pathways 
[39]. Therefore, GPMiner accurately identifies TFBSs in 
a set of gene promoters. The proposed system can be 
applied to analyze co-regulation in microarray gene- 
expression databases such as COXPRESdb [40] and 
Genevestigator [41]. The proposed GPMiner system 
improves our understanding of transcription regulatory 
networks of gene regulation in mammalians. 

Moreover, to demonstrate the application of single 
promoter analysis, a case study involving humans is 
described below. The v-fos FBJ murine osteosarcoma 
viral oncogene homolog (gene symbol is FOS) gene is a 
regulator of cell proliferation, differentiation, and trans- 
formation [42]. Through experimentally verified annota- 
tion of the Entrez Gene database, the FOS gene is 
regulated by numerous transcription factors such as 
SP1, SRF, SAP-1, and AP-1. Additionally, the FOS gene 
exhibited DNA methylation based on information in the 
Gene Ontology database. The FOS gene promoter 
sequence was extracted and input into GPMiner to 
mine the proximal promoter region and annotate regu- 
latory elements. The DNA stability of the input 
sequence is graphically represented and the proximal 
promoter region is highlighted (Figure S2, additional file 
1). Using the TSS prediction tool Eponine, potential 
TSSs are located near positions 500 and 2000 bps. The 
CpG islands were annotated, as were numerous TFs 
that may regulate the FOS gene promoter, including 
SP1, SRF, SAP-1, and AP-1. Moreover, the TATA box 
was annotated near position 2000 bps. To summarize 
annotated regulatory features, the proximal promoter 
region is likely located near 2000 bps since the experi- 
mentally validated TSS of the FOS gene was located at 
2001 bps. 

Conclusions 

The GPMiner system has a gene group analysis function 
for analyzing the co-occurrence of TFBSs with statistical 
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Figure 3 The submission and result interface of GPMiner. 



measures in a set of co-expressed genes. This function 
uses a practical platform to examine co-expression 
genes of microarray data in transcriptional regulation 
networks. Furthermore, the GPMiner system has a user- 



friendly input/output interface, and has numerous 
advantages in mammalian promoter analysis. The pro- 
posed system incorporates an SVM with nucleotide 
composition over-represented hexamer nucleotides and 
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Figure 4 Gene group analysis in GPMiner. 



DNA stability for mammalian proximal promoter identi- 
fication and mines regulatory elements, including TSSs, 
TFBSs, CpG islands, tandem repeats, the TATA box, 
CCAAT box, GC box, statistically over-represented 



sequence patterns, GC content (GC%) and DNA stabi- 
lity. Evaluated by benchmark cross-validation, the pre- 
dictive sensitivity and specificity of GPMiner are roughly 
80%. All mined promoter regions and regulatory 
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features in the user input sequence are graphically 
visualized to facilitate gene transcription analysis. Table 
3 compares the functions of several representative pro- 
grams for promoter annotation with those of GPMiner. 

The Functional Annotation of the Mouse 3 (FAN- 
TOMS) [43] provides comprehensive experimentally 
identified TSSs of human and mouse genomes by cap 
analysis of gene expression (CAGE) [44]. The compre- 
hensive TSSs of CAGE may be used to analyze promo- 
ters in advance. In addition to DNA stability, several 
structural properties of the DNA duplex in the promo- 
ter region, such as DNA curvature and bendability [45], 
should be analyzed and applied to predict identify gene 
promoter regions in mammals. Future versions of 
GPMiner will include detailed information about gene 
regulation such as microarray gene-expression profiles. 
The GPMiner system will be maintained and updated 
continuously. 

Availability 

The GPMiner web server will be continuously main- 
tained and updated. The web server is now freely avail- 
able at http://GPMiner.mbc.nctu.edu.tw/. 

Additional material 



Additional file 1: Additional figures and tables. Contains additional 
figures and tables showing further results in the study. 
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